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Abstract 

We present Quip, a lossless compression algorithm for 
next-generation sequencing data in the FASTQ and 
SAM/BAM formats. In addition to implementing 
reference-based compression, we have developed, to our 
knowledge, the first assembly-based compressor, using a 
novel de novo assembly algorithm. A probabilistic data 
structure is used to dramatically reduce the memory re- 
quired by traditional de Bruijn graph assemblers, allowing 
millions of reads to be assembled very efficiently. Read se- 
quences are then stored as positions within the assembled 
contigs. This is combined with statistical compression 
of read identifiers, quality scores, alignment information, 
and sequences, effectively collapsing very large datasets 
to less than 15% of their original size with no loss of in- 
formation. 

Availability: Quip is freely avail- 

able under the BSD license from 
http : //cs ■ Washington . edu/homes/dc j ones/quip , 

1 Introduction 

With the development of next-generation sequencing 
(NGS) technology researchers have had to adapt quickly 
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to cope with the vast increase in raw data. Experi- 
ments that would previously have been conducted with 
microarrays and resulted in several megabytes of data, 
are now performed by sequencing, producing many giga- 
bytes, and demanding a significant investment in com- 
putational infrastructure. While the cost of disk storage 
has also steadily decreased over time, it has not matched 
the dramatic change in the cost and volume of sequenc- 
ing. A transformative breakthrough in storage technol- 
ogy may occur in the coming years, but the era of the 
$1000 genome is certain to arrive before that of the $100 
petabyte hard disk. 

As cloud computing and software as a service become in- 
creasingly relevant to molecular biology research, hours 
spent transferring NGS datasets to and from off-site 
servers for analysis will delay meaningful results. More 
often researchers will be forced to maximize bandwidth 
by physically transporting storage media (via the "sneak- 
ernet"), an expensive and logistically complicated option. 
These difficulties will only be amplified as exponentially 
more sequencing data is generated, implying that even 
moderate gains in domain-specific compression methods 
will translate into a significant reduction in the cost of 
managing these massive datasets over time. 

Storage and analysis of NGS data centers primarily 



around two formats that have arisen recently as de facto 
standards: FASTQ and SAM. FASTQ stores, in addition 
to nucleotide sequences, a unique identifier for each read 
and quality scores, which encode estimates of the proba- 
bility that each base is correctly called. For its simplicity, 
FASTQ is a surprisingly ill-defined format. The closest 
thing to an accep ted specification is the description by 
Cock et al] ( 201Cll ). but the format arose ad hoc from mul- 
tiple sources (primarily Sanger and Solexa/IUumina), so 
a number of variations exist, particularly in how quality 
scores are encoded. The SAM format is far more complex 
but also more tightly defined, and comes with a referenc e 



implementation in the form of SAMtools |Li et al.l[2009l ) 



It is able to store alignment information in addition to 
read identifiers, sequences, and quality scores. SAM files, 
which are stored in plain text, can also be converted to 
the BAM format, a compressed binary version of SAM, 
which is far more compact and allows for relatively effi- 
cient random access. 

Compression of nucleotide sequences has been the tar- 
get of some interest, but compressing NGS data, made 
up of millions of short fragments of a greater whole, 
combined with metadata in the form of read identifiers 
and quality scores, presents a very different problem and 
demands new techniques. Splitting the data into sepa- 
rate contexts for read identifiers, sequences, and quality 
scores and compressing them with the Lempel-Zip algo- 
rit hm and Huffman cod ing h as been explored explored 



by Tembe et al. ( 20101 ) and Deorowicz and Grabowski 



(|201lh . who demonstrate the promise of domain-specific 
compression with significant gains over general-purpose 
programs like gzip and bzip2. 



Kozanitis et al.l (|201lf ) and iHsi-Yang Fritz et al.l (|201lr ) 
proposed reference-based compression methods, exploit- 
ing the redundant nature of the data by aligning reads to 
a known reference genome sequence and storing genomic 
positions in place of nucleotide sequences. Decompression 
is then performed by copying the read sequences from the 
genome. Though any differences from the reference se- 
quence must also be stored, referenced-based approaches 
can achieve much higher compression and they grow in- 
creasing efficient with longer read lengths, since storing 
a genomic position requires the same amount of space. 



regardless of the length of the read. 

This idea is explored also in the Goby format 
( http : //ccunpagnelab . org/sof tware/goby/p , which 
has been proposed an alternative to SAM/BAM, the 
primary functional difference being that sequences of 
aligned reads are not stored but looked up in a reference 
genome when needed (frequently they are not). For some 
applications, reference-based compression can be taken 
much further by storing only SNP information, sum- 
marizing a sequencing experiment in several megabytes 



(jChristlev et al.l I2009D . However, even when SNP calls 



are all that is needed, discarding the raw reads would 
prevent any reanalysis of the data. 

While a reference-based approach typically results in 
superior compression it has a number of disadvan- 
tages. Most evidently, an appropriate reference sequence 
database is not always available, particularly in the case 
of metagenomic sequencing. One could be contrived by 
compiling a set of genomes from species expected to be 
represented in the sample. However, a high degree of ex- 
pertise is required to curate and manage such a project- 
dependent database. Secondly, there is the practical 
concern that files compressed with a reference-based ap- 
proach are not self-contained. Decompression requires 
precisely the same reference database used for compres- 
sion, and if it is lost or forgotten the compressed data 
becomes inaccessible. 

Another recurring theme in the the growing literature 
on short read compression is lossy encoding of sequence 
quality scores. This follows naturally from the realiza- 
tion that quality scores are particularly difficult to com- 
press. Unlike read identifiers, which are highly redundant, 
or nucleotide sequences, which contain some structure, 
quality scores are inconsistently encoded between proto- 
cols and computational pipelines and are often simply 
high-entropy. It is dissatisfying that metadata (quality 
scores) should consume more space than primary data 
(nucleotide sequences). Yet, also dissatisfying to many 
researchers is the thought of discarding information with- 
out a very good understanding of its effect on downstream 
analysis. 

A number of lossy compression algorithms for 
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quality scores have 
ing various bin ning 
QScores-Archiver (|Wan et al 



been proposed, includ- 
schemes implemented in 
' iOllI ) and SCALCE 
( [iittp : //s calce . sourcef orge .net), scaling to a re- 
duced alphabet with r andomized rounding in SlimGene 
(jKozanitis et all 120111 ) , and discarding quality scores for 
bases which match a refere nce sequence in Cramtools 
(|Hsi-Yang Fritz et all boill ). In SCALCE, SlimGene, 



and Cramtool s, quality score s may also be losslessly 



compressed. iKozanitis et alj (|2011l ) analyzed of the 



effects of their algorithm on downstream analysis. Their 
results suggest that while some SNP calls are affected, 
they are primarily marginal, low-confidence calls between 
hetero- and homozygosity. 



Decreasing the entropy of quality scores while retaining 
accuracy is an important goal, but successful lossy com- 
pression demands an understanding of what is lost. For 
example, lossy audio compression (e.g. MPS) is grounded 
in psychoacoustic principles, preferentially discarding the 
least perceptible sound. Conjuring a similarly principled 
method for NGS quality scores is difhcult given that both 
the algorithms that generate them and the algorithms 
that are in formed by them are rn oving targets. In the 
analysis by IKozanitis et al.l (|201ll ) , the authors are ap- 
propriately cautious in interpreting their results, pointing 
out that "there are dozens of downstream applications 
and much work needs to be done to ensure that coarsely 
quantized quality values will be acceptable for users." 



2 Materials Methods 



2.1 Statistical Compression 

The basis of our approach is founded on statistical com- 
pression using arithmetic coding, a form of entropy cod- 
ing which approaches optim ality, but re quires some care 
to implement efficiently (see Saidl (2004) for an excellent 
review). Arithmetic coding can be thought of as a refine- 
ment of Huffman coding, the major advantage being that 
it is able to assign codes of a non-integral number of bits. 
If a symbol appears with probability 0.1, it can be en- 
coded near to its optimal code length of — Iog2(0.1) « 3.3 
bits. Despite its power, it has historically seen much less 
use than Huffman coding, due in large part to fear of 
infringing on a number of patents that have now expired. 

Arithmetic coding is a particularly elegant means of com- 
pression in that it allows a complete separation between 
statistical modeling and encoding. In Quip, the same 
arithmetic coder is used to encode quality scores, read 
identifiers, nucleotide sequences, and alignment informa- 
tion, but with very different statistical models for each, 
which gives it a tremendous advantage over general- 
purpose compression algorithms that lump everything 
into a single context. Futhermore, all parts of our al- 
gorithm use adaptive modeling: parameters are trained 
and updated as data is compressed, so that an increas- 
ingly tight fit and higher compression is achieved on large 
files. 



Read Identifiers 



In the following sections we describe and evaluate Quip, 
a new lossless compression algorithm which leverages a 
variety of techniques to achieve very high compression 
over sequencing data of many types, yet remains effi- 
cient and practical. We have implement this approach 
in a open-source tool that is capable of compressing both 
BAM/SAM and FASTQ files, retaining ah information 
from the original file. 



The only requirement of read identifiers is that they 
uniquely identify the read. A single integer would do, 
but typically each read comes with a complex string con- 
taining the instrument name, run identifier, flow cell iden- 
tifier, and tile coordinates. Much of this information is 
the same for every read and is simply repeated, inflating 
the file size. 

To remove this redundancy, we use a form of delta encod- 
ing. A parser tokenizes the ID into separate fields which 
are then compared to the previous ID. Tokens that remain 
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the same from read to read (e.g. instrument name) can be 
compressed to a negligible amount of space — arithmetic 
coding produces codes of less than 1 bit in such cases. Nu- 
merical tokens are recognized and stored efficiently, either 
directly or as an offset from the token in the same posi- 
tion in previous read. Otherwise non-identical tokens are 
encoded by matching as much of the prefix as possible 
to the previous read's token before directly encoding the 
non-matching suffix. 

The end result is that read IDs, which are often 50 bytes 
or longer, are typically stored in 2-4 bytes. Notably, in 
reads produced from lUumina instruments, most parts of 
the ID can be compressed to consume almost no space; 
the remaining few bytes are accounted for by tile coor- 
dinates. These coordinates are almost never needed in 
downstream analysis, so removing them as a preprocess- 
ing step would shrink file sizes even further. The parser 
used is suitably general so that no change to the compres- 
sion algorithm would be needed. 



Quality Scores 

It has been previously noted that the quality score at a 
given position is hi ghly correlated with th e score at the 



preceding position (jKozanitis et al.l . l201iri . This makes 



a Markov chain a natural model, but unlike nucleotides, 
quality scores are over a much larger alphabet (typically 
41-46 distinct scores). This limits the order of the Markov 
chain: long chains will require a great deal of space and 
take a unrealistic amount of data to train. 

To reduce the number of parameters, we use an order- 
3 Markov chain, but coarsely bin the distal two posi- 
tions. In addition to the preceding three positions, we 
condition on the position within the read and a running 
count of the number large jumps in quality scores be- 
tween adjacent positions (where a "large jump" is defined 
as \qj — > 1), which allows reads with highly vari- 
able quality scores to be encoded using separate models. 
Both of these variables are binned to control the number 
of parameters. 



Nucleotide Sequences 

To compress nucleotide sequences, we adopt a very sim- 
ple model based on high-order Markov chains. The nu- 
cleotide at a given position in a read is predicted using 
the preceding twelve positions. This model uses more 
memory than traditional general-purpose compression al- 
gorithms (4-'^^ = 67, 108, 864 parameters are needed, each 
represented in 32 bits) but it is simple and extremely effi- 
cient (very little computation is required and run time is 
limited primarily by memory latency, as lookups in such 
a large table result in frequent cache misses). 

An order- 12 Markov chain also requires a very large 
amount of data to train, but there is no shortage with 
the datasets we wish to compress. Though less adept at 
compressing extremely short files, after compressing sev- 
eral million reads the parameters are tightly fit to the 
nucleotide composition of the dataset so that the remain- 
ing reads will be highly compressed. Compressing larger 
files only results in a tighter fit and higher compression. 



2.2 Reference-Based Compression 

We have also implemented lossless reference-based com- 
pression. Given aligned reads in SAM or BAM format, 
and the reference sequence to which they are aligned (in 
FASTA format), the reads are compressed preserving all 
information in the SAM/BAM file, including the header, 
read IDs, alignment information, and all optional fields al- 
lowed by the SAM format. Unaligned reads are retained 
and compressed using the Markov chain model. 

2.3 Assembly- Based Compression 

To complement the reference-based approach, we devel- 
oped an assembly-based approach which offers some of 
the advantages of reference-based compression, but re- 
quires no external sequence database and produces files 
which are entirely self-contained. We use the first (by de- 
fault) 2.5 million reads to assemble contigs which are then 
used in place of a reference sequence database to encode 
aligned reads compactly as positions. 
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Once contigs are assembled, read sequences are aligned 
using a simple "seed and extend" method: 12-mer seeds 
are matched using a hash table, and candidate alignments 
are evaluated using Hamming distance. The best (lowest 
Hamming distance) alignment is chosen, assuming it falls 
below a given cutoff, and the read is encoded as a position 
within the contig set. Roughly, this can be thought of as a 
variation on the Lempel-Ziv algorithm: as sequences are 
read, they are matched to previously observed data, or 
in this case, contigs assembled from previously observed 
data. These contigs are not explicitly stored, but rather 
reassembled during decompression. 

Traditionally, de novo assembly is extremely computa- 
tionally intensive. The most commonly used technique 
involves constructing a de Bruijn graph, a directed graph 
in which each vertex represents a nucleotide /c-mer present 
in the data for some fixed k (e.g., /c = 25 is a common 
choice). A directed edge from a /c-mer u to v occurs if 
and only if the {k — l)-mer suffix of u is also the prefix of 
V. In principle, given such a graph, an assembly can be 
produced by finding an Eulerian path, i.e., a path that fol- 



lows each edge in the graph exactly once (jPevzner et al. 



20011 ) ■ In practice, since NGS data has a non- negligible 
error rate, assemblers augment each vertex with the num- 
ber of observed occurrences of the /c-mer and leverage 
these counts using a variety of heuristics to filter out spu- 
rious paths. 

A significant bottleneck of the de Bruijn graph approach 
is building an implicit representation of the graph by 
counting and storing /c-mer occurrences in a hash table. 
The assembler implemented in Quip overcomes this bot- 
tleneck to a large extent by using a data structure based 
on the Bloom filter to count /c-mers. The Bloom filter 
()Blooml . Imdt) is a probabilistic data structure that rep- 
resents a set of elements extremely compactly, at the cost 
of elements occasionally colliding and incorrectly being 
reported as present in the set. It is probabilistic in the 
sense that these collisions occur pseudo-randomly, deter- 
mined by the size of the table and the hash functions 
chosen, but generally with low probability. 

The Bloom filter is generalized in the counting Bloom fil- 
ter, in which an arbitrar y coun t can be associated with 
each element ( Fan et al. . 2000l ). and further refined in 



the d -left counting Bloom filter (dlCBF) (jBonomi et al 
20061 ). which requires significantly less space than the al- 
ready quite space efficient counting Bloom filter. We base 
our assembler on a realization of the dlCBF. Because we 
use a probabilistic data structure, /c-mers are occasionally 
reported to have incorrect (inflated) counts. The assem- 
bly can be made less accurate by these incorrect counts, 
however a poor assembly only results in slightly increasing 
the compression ratio. Compression remains lossless re- 
gardless of the assembly quality, and in practice collisions 
in the dlCBF occur at a very low rate (this is explored in 
the results section). 

Given a probabilistic de Bruijn graph, we assemble con- 
tigs using a very simple greedy approach. A read se- 
quence is used as a seed and extended on both ends one 
nucleotide at a time by repeatedly finding the most abun- 
dant /c-mer that overlaps the end of the contig by /c — 1 
bases. More sophisticated heuristics have been developed, 
but we choose to focus on efficiency, sacrificing a degree 
of accuracy. 

Counting /c-mers efficiently with the help of Bloom fil- 
ters w as previously explored by iMelsted and Pritchard 
(|2011l ). who use it in addition, rather than in place, of 
a hash table. The Bloom filter is used as a "staging area" 
to store /c-mers occurring only once, reducing the memory 
required by the h ash table. Concurrently with our work. 
Pell et al.l poill ) have also developed a probabilistic de 
Bruijn graph assembler, but using a non-counting Bloom 
filter. While they demonstrate a significant reduction in 
memory use, unlike other de Bruijn graph assemblers, 
only the presence or absence of a k-mer is stored, not its 
abundance, which is essential information when the goal 
is producing accurate contigs. 



2.4 Metadata 

In designing the file format used by Quip, we included 
several useful features to protect data integrity. First, 
output is divided into blocks of several magabytes each. 
In each block a separate 64 bit checksum is computed for 
read identifiers, nucleotide sequences, and quality scores. 
When the archive is decompressed, these checksums are 
recomputed on the decompressed data and compared to 
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the stored checksums, verifying the correctness of the out- 
put. The integrity of an archived dataset can also be 
checked with the "quip -test" command. 

Apart from data corruption, reference-based compression 
creates the possibihty of data loss if the reference used for 
compression is lost, or an incorrect reference is used. To 
protect against this. Quip files store a 64 bit hash of the 
reference sequence, ensuring that the same sequence is 
used for decompression. To assist in locating the correct 
reference, the file name, and the lengths and names of 
the sequences used in compression are also stored and 
accessible without decompression. 

Additionally, block headers store the number of reads and 
bases compressed in the block, allowing summary statis- 
tics of a dataset to be listed without decompression using 
the "quip -list" command. 



3 Results 



3.1 Compression of Sequencing Data 

We compared Quip to three commonplace general- 
purpose compression algorithms: gzip, bzip2, and xz, 
as well as more recently deve loped domain-specific com- 
press ion algorithms: DSRC (Deorowicz and GrabowskiL 



20111 ) and Cramtools (jHsi-Yang Fritz et all. 1201 ll 
methods have been proposed (ITembe et al 



Kozanitis et al.l . l201ll : iBhola et al.l . 120111 ). but without 



Other 
20inl: 



publicly available software we were unable independently 
evaluate them. We have also restricted our focus to 
lossless compression, and have not evaluated a number 



of promising loss y methods (|Hsi-Yang Fritz et al.l . 12011 



Wan et al.ll2011[ ). nor me thods only c apabl e of compress- 



ing nucleotide sequences (jCox et al.L 120121 ) . We invoked 
Quip in three ways: using only statistical compression 
("quip"), using reference-based compression ("quip -r"), 
and finally with assembly-based compression ("quip -a"). 
Table [T] gives an overview of the methods evaluated. 

We acquired six datase ts (Table [2t fr om t he Se- 



quence Read Archive (jLeinonen et al.l 120111) . rep 



of next-generation sequencing. Genome and ex- 
ome seq uencing data was taken from 1 000 G enomes 
Project |The 1000 Genomes Consortiuml . boiol ). total 



and mRNA data from the lUumina BodyMap 2.0 dataset 



(jAsmann et al.l . l2012f) . Runx2 ChlP-Seq p erformed in a 



prostate cancer cell line (jLittle et al.l . 120111) , and metage 



nomic DNA sequencing of biofilm found in the extremely 
acidic drainage fro m the Richmond Mine in California 



()Denef et al.l . I2OIOI) 



resenting a broad sampling of recent applications 



The Sequence Read Archive provides data in its own SRA 
compression format which we also evaluate here. Pro- 
grams for working with SRA files are provided in the 
SRA Toolkit, but a convenient means of converting from 
FASTQ to SRA is not, so we have not measured compres- 
sion time and memory in this case. 

In the case of reference-based compression implemented 
in Cramtools and Quip, we aligned reads from all but 
the metagenomic data to the GRC h37/hgl9 assembly o f 
the human genome using GSNAP (IWu and Nacul . I2010I) 
generating a BAM file. Splice-junction prediction was 
enabled for the two RNA-Seq datasets. In the case of 
multiple alignments, we removed all but the the primary 
(i.e., highest scoring). Quip is able to store secondary 
alignments, but the version of Cramtools we evaluated 
was not. When the purpose of alignment in simply com- 
pactly archiving the the reads, secondary alignments have 
no purpose and merely inflate the flle size, but in down- 
stream analysis they are often extremely informative and 
retaining them may be desirable in some cases. 

It is important to note pure lossless compression is 
not the intended goal of Cramtools. Though we in- 
voked Cramtools with options to include all quality 
scores ( — capture-all-quality-scores), optional tags 
( — capture-all-tags), and retain unaligned reads ( 
— include-unmapped-reads), it is unable to store the 
original read IDs. This puts it at advantage when com- 
paring compressed file size, as all other programs com- 
pared were entirely lossless, but as the only other avail- 
able reference-based compression method, it is a useful 
comparison. 

With each method we compressed then decompressed 
each dataset on a server with a 2.8Ghz Intel Xeon proces- 
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sor and 64 GB of memory. In addition to recording the 
compressed file size (Figure [J), we measured wall clock 
run-time (Figure [S]) and maximum memory usage (Table 
[3|) using the Unix time command. Run time was nor- 
malized to the size of the original FASTQ file, measuring 
throughput in megabytes per second. 

For the reference-based compressors, time spent aligning 
reads to the genome was not included, though it required 
up to several hours. Furthermore, while Quip is able to 
compress aligned reads in an uncompressed SAM file and 
then decompress back to SAM (or FASTQ), the input and 
output of Cramtools is necessarily a BAM file, and so we 
also use BAM files as input and output in our benchmarks 
of "quip -r" . This conflates measurements of compression 
and decompression time: since BAM files are compressed 
with zlib, compression times for the reference-based com- 
pressors include time needed to decompress the BAM 
file, and decompression times include re-compressing the 
output to BAM. Consequently, decompression speeds in 
particular for both Cramtools and reference-based Quip 
( "quip -r" ) appear significantly lower than they otherwise 
might. 

In the single-genome samples, we find that reference- 
based compression using Quip consistently results in the 
smallest file size (i.e., highest compression), yet even with- 
out a reference sequence, compression is quite high, even 
matching the reference-based approach used by Cram- 
tools in two of the datasets. Assembly-based compres- 
sion provides almost no advantage in the single-genome 
ChlP-Seq, Whole Genome, and Exome datasets. Because 
we assembly only several million reads (2.5 million, by 
default), coverage is only enough to assemble representa- 
tives of certain families of repeats in these datasets. 

In the RNA-Seq datasets, more is assembled (approxi- 
mately 35% and 70% of the reads are aligned to assembled 
contigs in the total RNA-Seq and mRNA-Seq datasets, 
respectively), but this only provides a small increase in 
the compression overall. Mostly probably, the aligned 
reads in these datasets are relatively low-entropy, com- 
posed largely of protein-coding sequence, and would be 
encoded only moderately less compactly with the Markov 
chain method. 



Nevertheless, assembly-based compression does provide a 
significant advantage in the metagenomic dataset. Nearly 
95% of reads align to assembled contigs, resulting in a 
large reduction in compressed file size. Though this may 
be the result of low species diversity and may not ex- 
tend to all metagenomic datasets, it shows that assembly- 
based compression is beneficial in certain cases. Though 
it comes at some cost, we are able to perform assembly 
with exceedingly high efficiently: memory usage is in- 
creased only by 400 MB and compression time is faster 
than all but DSRC and assembly-free Quip even while 
assembling 2.5 million reads and aligning millions more 
reads to the assembled contigs. 

With all invocations of Quip, the size of the resulting 
file is dominated, sometimes overwhelmingly, by quality 
scores (Figure [2]). There is not a direct way to mea- 
sure this using the other methods evaluated, but this 
result suggests that increased compression of nucleotide 
sequences will result in diminishing returns. 



Average Memory Usage (MB) 



Method 


Compression 


Decompression 


gzip 


0.8 


0.7 


bzip2 


7.0 


4.0 


xz 


96.2 


9.0 


sra 


NA 


46.9 


dsrc 


28.3 


15.1 


cramtools 


6803.8 


6749.3 


quip 


412.1 


412.9 


quip -a 


823.0 


794.1 


quip -r 


1710.0 


1717.2 



Table 3: Maximum memory usage was recorded for 
each program and dataset. We list the average across 
datasets of these measurements. For most of the pro- 
grams, memory usage varied by less that 10% across 
datasets and well summarized by the mean, with the 
exception of SRA. Decompression used less that 60 
MB except in the human exome sequencing dataset, 
in which over 1.4 GB was used. We report the mean 
with this outlier removed. 
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Program 
gzip (1.4) 
bzip2 (1.0.6) 
xz (5.0.3) 
sra (2.1.10) 
dsrc (0.3) 
cramtools (0.85- 
quip (1.1.0) 
quip -a (1.1.0) 
quip -r (1.1.0) 



b51) 



Input 

Any 

Any 

Any 

FASTQ 

FASTQ 

BAM 

FASTQ, SAM, BAM 
FASTQ, SAM, BAM 
SAM, BAM 



Methods 
Lempel-Ziv 

Burrows- Wheeler and Huffman coding 

Lempel-Ziv with arithmetic coding 

No pubUshed description 

Lempel-Ziv and Huffman coding. 

Reference-based compression 

Statistical modeling with arithmetic coding 

Quip with assembly-based compression 

Quip with reference-reference based compression 



Table 1: Methods evaluated in the results section. All methods compared are lossless, with the exception of 
Cramtools which does not preserve the original read identifiers. 



Accession Number Description Source Read Length 

SRR400039 Whole Genome The 1000 Genomes Project fThe 1000 Genomes Consortium . 2010) 101 

SRR125858 Exome The 1000 Genomes Project fThe 1000 Genomes Consortium . 2010) 76 

SRR372816 Runx2 ChlP-Seq Little, et al fL ittle et al . 2011) 50 

ERR030867 Total RN A The BodvMap 2.0 Project. Asmann. et al fAsmann et al.. 2012) 100 

ERR030894 mRNA The BodvMap 2.0 Project, Asmann. et al fAsmann et al., 2012) 75 

SRR359032 Metagenomic DNA Denef, et al fPenef et al. , 2010) 100 



Table 2: We evaluated compression on a single lane of sequencing data taken from a broad collection of seven 
studies. Except for the metagenomic data, each was generated from human samples. Uncompressed file sizes are 
shown in gigabytes. 



Whole Genome 



Exome 



Runx2 ChlP-Seq 



Total RNA 




mRNA 



Metagenomic DNA 




0.2 0.3 0.0 0.1 0.2 
Compression Ratio 



0.3 0.0 0.1 0.2 




Figure 1: One lane of sequencing data from each of six publicly available datasets (Table\^ was compressed using a 
variety of methods (Table{^. The size of the compressed data is plotted in proportion to the size of uncompressed 
data. Note that all methods evaluated are entirely lossless with the exception of Cramtools, marked in this plot 
with an asterisk, which does not preserve read IDs, giving it some advantage in these comparisons. Reference- 
based compression methods, in which an external sequence database is used, were not applied to the metagenomic 
dataset for lack of an obvious reference. These plots are marked "N/A " . 
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Figure 2: After compressing with Quip using the base algorithm (labeled "quip"), assembly-based compression 
("quip -a"), and reference-based compression ("quip -r"), we measure the size of read identifiers, nucleotide 
sequences, and quality scores in proportion to the total compressed file size. Reference-based compression ("quip 
-r") was not applied to the metagenomic data. 
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Figure 3: The wall clock run-time of each evaluation was recorded using the Unix time command. Run-time 
is normalized by the size of the original FASTQ file to obtain a measure of compression and decompression 
throughput, plotted here in megabytes per second. Compression speed is plotted to the left of the zero axis, 
and decompression speed to the right. There is a not a convenient means of converting from FASTQ to SRA, 
so compression time is not measured for SRA. Additionally, reference-based methods are not applied to the 
metagenomic data. In the case of reference-based compression implemented in "cramtools" and "quip -r", input 
and output is in the BAM format, so the speeds plotted here include time needed to decompress and compress 
the BAM input and output, respectively. 
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3.2 Characteristics of the d-left Counting Bloom Filter 



0.008 



Though our primary goal is efficient compression of se- 
quencing data, the assembly algorithm we developed to 
achieve this is of independent interest. Only very recently 
has the idea of using probabilistic data structures in as- 
sembly been breached, and to our knowledge, we are the 
first to build a functioning assembler using any version 
of the counting Bloom filter. Data structures based on 
the Bloom filter make a trade-off between space and the 
probability of inserted elements colliding. Tables can be 
made arbitrarily small at the cost of increasing the num- 
ber of collisions. To elucidate this trade-off we performed 
a simple simulation comparing the dlCBF to a hash table. 

Comparisons of data structures are notoriously sen- 
sitive to the specifics of the implementation. To 
perform a meaningful benchmark, we compared our 
dlCBF implementation to the sparsehash library 
(http: //code. google. com/p/sparsehash/f), an open- 



source hash table implementation with the expressed 
goal of maximizing space efficiency. Among many 
other uses, it is the c ore da ta structure in t he ABySS 
dSimpson and Durbinl . l201l[ ) and PASHA (|Liu et al 
20111 ) assemblers. 



We randomly generated 10 million unique, uniformly dis- 
tributed 25-mers and inserted them into a hash table, 
allocated in advance to be of minimal size while avoiding 
resizing and rehashing. We repeated this with dlCBF ta- 
bles of increasing sizes. Upon insertion, a collision occurs 
when the hash functions computed on the inserted k-mei 
collide with a previously inserted /c-mer. An insertion 
may also fail when a fixed size table is filled to capacity 
and no empty cells are available. For simplicity, we count 
these occurrences also as collisions. Wall clock run-time 
and maximum memory usage were both recorded using 
the Unix time command. 

We find that with only 20% of the space of the hash table 
the dlCBF accrues a collision rate of less than 0.001 (Fig- 
ure SJ. While the hash table performed the 10 million 
insertions in 7.34 seconds, it required only 4.48 seconds 
on average for the dlCBF to do the same, with both sim- 
ulations run on a 2.8Ghz Intel Xeon processor. Table 
size did not greatly affect the runtime of the dlCBF. The 



0.000 



Memory 
(in proportion to sparsehasli) 

Figure 4: The trade-off between memory usage and 
false positive rate in the dlCBF is evaluated by in- 
serting 10 million unique 25-mers into tables of in- 
creasing size. Memory usage is reported as the pro- 
portion of the memory used by a memory efficient 
hash table to do the same. 



assembler implemented in Quip uses a fixed table size, 
conservatively set to allow for 100 million unique k-mevs 
with a collision rate of approximately 0.08% in simula- 
tions. 

Though the authors of sparsehash claim only a 4 bit over- 
head for each entry, and have gone to considerably effort 
to achieve such efficiency, it still must store the k-raev it- 
self, encoded in 64 bits. The dlCBF avoids this, storing 
instead a 14-bit "fingerprint" , or hash of a /(-mer, result- 
ing in the large savings we observe. Of course, a 25-mer 
cannot be uniquely identified with 14 bits. False positives 
are thus introduced, yet they are kept at a very low rate 
by the d-left hashing scheme. Multiple hash functions are 
used under this scheme, so that multiple hash function 
must collide to result in a collision, an infrequent event if 
reasonably high-quality hash functions are chosen. 



A previous analysis of the dlCBF by IZhang et al.l (|2009l ) 

compared it to two other variations of the counting Bloom 
filter and concluded that "the dlCBF outperforms the 
others remarkably, in terms of both space efficiency and 
accuracy." Overall, this data structure appears particu- 
larly adept for high efficiency de novo assembly. 
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4 Discussion 

Compared to the only other pubhshed and freely-available 
reference-based compressor, Cramtools, we see a signifi- 
cant reduction in compressed file size, despite read identi- 
fiers being discarded by Cramtools and retained by Quip. 
This is combined with dramatically reduced memory us- 
age, comparable run-time (slightly faster compression 
paired with varying relative decompression speeds), and 
increased flexibility (e.g. Quip is able to store multiple 
alignments of the same read and does not require that the 
BAM file be sorted or indexed). Conversely, Cramtools 
implements some potentially very useful lossy methods 
not provided by Quip. For example, a number of options 
are available for selectively discarding quality scores that 
are deemed unnecessary, which can significantly reduce 
compressed file sizes. 

We find that assembly-based compression offers only a 
small advantage in many datasets. Because we assemble 
a relatively small subset of the data, coverage is too low 
when sequencing a large portion of a eukaryotic genome. 
In principle, the assembler used could be scaled to handle 
exponentially more reads, but at the cost of being in- 
creasingly computationally intensive. In such cases, the 
reference-based approach may be more appropriate. Yet, 
in the metagenomic dataset we evaluate, where there is 
no obvious reference sequence, nucleotide sequences are 
reduced in size by over 40% (Figure [2]). While limited in 
scope, for certain projects assembly-based compression 
can prove to be invaluable, and the algorithm devised 
here makes it a tractable option. 

The Lempel-Ziv algorithm, particularly as implemented 
in gzip/libz has become a reasonable default choice for 
compression. The zlib library has matured and stabilized 
over the course of two decades and is widely available. 
The BAM and Goby formats both use zlib for compres- 
sion, and compressing FASTQ files with gzip is still com- 
mon practice. Despite its ubiquity, our benchmarks show 
that it is remarkably poorly suited to NGS data. Both 
compression ratio and compression time were inferior to 
the other programs evaluated. For most purposes, the 
gains in decompression time do not make up for its short- 
comings. With the more sophisticated variation imple- 



mented in xz, compression is improved but at the cost of 
tremendously slow compression. 

Our use of high-order Markov chain models and de novo 
assembly results in a program that uses significantly more 
memory than the others tested, with the exception of 
Cramtools. Though limiting the memory used by a gen- 
eral purpose compression program enables it to be used on 
a wider variety of systems, this is less important in this 
domain-specific application. Common analysis of next- 
generation sequencing data, whether it be alignment, as- 
sembly, isoform quantification, peak calling, or SNP call- 
ing all require significant computational resources. Tar- 
geting low-memory systems would not be of particular 
benefit: next-generation sequencing precludes previous- 
generation hardware. 

Though memory consumption is not a top priority, run- 
time is important. Newer instruments like the HiSeq 2000 
produce far more data per lane than previously possible. 
And, as the cost of sequencing continues to decrease, ex- 
periments will involve more conditions, replicates, and 
timepoints. Quip is able to compress NGS data at three 
times the speed of gzip, while performing de novo as- 
sembly of millions of reads, and up to five times as fast 
without the assembly step. Only DSRC is faster, but 
with consistently lower compression. In addition, our 
reference-based compression algorithm matches the speed 
of cramtools but with substantially better lossless com- 
pression. 

As illustrated in Figure [H the size of compressed NGS 
data is dominated by quality scores. More sophisticated 
compression of nucleotide sequences may be possible but 
will result in only minor reductions to the overall file size. 
The largest benefit would be seen by reducing quality 
scores. While it is easy to reduce the size of quality data 
by coarse binning or other lossy methods, it is very hard 
to determine the effect such transformations will have on 
downstream analysis. Future work should concentrate on 
studying lossy quality score compression, strictly guided 
by minimizing loss of accuracy in alignment, SNP calling, 
and other applications. Quality scores are encoded in 
Quip using an algorithm that is suitably general so that 
lossy transformations (e.g., binning) can be automatically 
exploited, and can be treated as an optional preprocessing 
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Other aspects of the algorithm presented here can be im- 
proved, and will be with future versions of the software. 
A largo body of work exists exploring heuristics to im- 
prove the the quality of de novo assembly, however the 
algorithm in Quip uses a very simple greedy approach. 
Assembly could likely be improved without greatly re- 
ducing efficiency by exploring more sophisticated meth- 
ods. We currently perform no special handling of paired- 
end reads, so that mates arc compressed independently of 
each other. In principle some gains to compression could 
be made by exploiting pair information. We also intend to 
implement parallel compression and decompression. This 
is non-trivial, but quite possible and worthwhile given the 
abundance of data and ubiquity of multi-core processors. 

Combining reference- based and assembly-based tech- 
niques, with carefully tuned statistical compression, the 
algorithm presented in Quip probes the limit to which 
next-generation sequencing data can be compressed loss- 
lessly, yet remains efficient enough be a practical tool 
when coping with the deluge of data that biology research 
is now presented with. 
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